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Abstract: Estimation and prediction problems for dense signals are often framed in 
terms of minimax problems over highly symmetric parameter spaces. In this paper, we 
study minimax problems over ^^-balls for high-dimensional linear models with Gaussian 
predictors. We obtain sharp asymptotics for the minimax risk that are applicable in 

'^ i any asymptotic setting where the number of predictors diverges and prove that ridge 

C^ ' regression is asymptotically minimax. Adaptive asymptotic minimax ridge estimators 

are also identified. Orthogonal invariance is heavily exploited throughout the paper 
and, beyond serving as a technical tool, provides additional insight into the problems 
considered here. Most of our results follow from an apparently novel analysis of an 
equivalent non-Gaussian sequence model with orthogonally invariant errors. As with 

^v^ . many dense estimation and prediction problems, the minimax risk studied here has rate 

t~^ ' d/n, where d is the number of predictors and n is the number of observations; however, 

*n , when d ^ n the minimax risk is influenced by the spectral distribution of the predictors 

and is notably different from the linear minimax risk for the Gaussian sequence model 

CO . (Pinsker, 1980) that often appears in other dense estimation and prediction problems. 
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C^ , 1. Introduction 

This paper is about estimation and prediction problems involving non-sparse (or "dense") 
signals in high- dimensional linear models. By contrast, a great deal of recent research into 
high-dimensional linear models has focused on sparsity. Though there are many notions of 
sparsity (e.g. £^-sparsity (Abramovich et al., 2006)), a vector /3 G M"^ is typically consid- 
ered to be sparse if many of its coordinates are very close to 0. Perhaps one of the general 
principals that has emerged from the literature on sparse high- dimensional linear models 
may be summarized as follows: if the parameter of interest is sparse, then this can often be 
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leveraged to develop methods that perform very well, even when the number of predictors 
is much larger than the number of observations. Indeed, powerful theoretical performance 
guarantees are available for many methods developed under this paradigm, provided the pa- 
rameter of interest is sparse (Bickel et al., 2009; Bunea et al., 2007; Candes and Tao, 2007; 
Fan and Lv, 2011; Rigollet and Tsybakov, 2011; Zhang, 2010). Furthermore, in many appli- 
cations - especially in engineering and signal processing - sparsity assumptions have been re- 
peatedly validated (Donoho, 1995; Duarte et al., 2008; Erlich et al., 2010; Lustig et al., 2007; 
Wright et al., 2008). However, there is less certainty about the manifestations of sparsity in 
other important applications where high-dimensional data is abundant. For example, sev- 
eral recent papers have questioned the degree of sparsity in modern genomic datasets (see, 
for instance, (Hall et al., 2009), and the references contained therein - including (Goldstein, 
2009; Hirschhorn, 2009; Kraft and Hunter, 2009) - and, more recently, (Bansal et al., 2010; 
Manolio, 2010)). In situations like these, sparse methods may be sub-optimal and methods 
designed for dense problems may be more appropriate. 

Let d and n denote the number of predictors and observations, respectively, in a linear 
regression problem. In dense estimation and prediction problems, where the parameter of 
interest is not assumed to be sparse, d/n — > is often required to ensure consistency. Indeed, 
this is the case for the problems considered in this paper. In this sense, dense problems are 
more challenging than sparse problems, where consistency may be possible when d/n — )■ oo. 
This lends credence to Friedman et al.'s (2004) "bet on sparsity" principle for high- dimensional 
data analysis: 

Use a procedure that does well in sparse problems, since no procedure does well in dense problems. 

The "bet on sparsity" principle has proven to be very useful, especially in applications where 
sparsity prevails, and it may help to explain some of the recent emphasis on understanding 
sparse problems. However, the emergence of important problems in high- dimensional data 
analysis where the role of sparsity is less clear highlights the importance of characterizing 
and thoroughly understanding dense problems in high- dimensional data analysis. This paper 
addresses some of these problems. 

Minimax problems over highly symmetric parameter spaces have often been equated with 
dense estimation problems in many statistical settings (Donoho and Johnstone, 1994; Johnstone, 
2011). In this paper, we study the minimax risk over £^-balls for high-dimensional linear 
models with Gaussian predictors. We identify several informative, asymptotically equivalent 
formulations of the problem and provide a complete asymptotic solution when the number of 
predictors d grows large. In particular, we obtain sharp asymptotics for the minimax risk that 
are applicable in any asymptotic setting where (i — ?■ oo and we show that ridge regression es- 
timators (Hoerl and Kennard, 1970; Tikhonov, 1943) are asymptotically minimax. Adaptive 
asymptotic minimax ridge estimators are also discussed. Our results follow from carefully ana- 
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lyzing an equivalent non-Gaussian sequence model with orthogonally invariant errors and the 
novel use of two classical tools - Brown's identity (Brown, 1971) and Stam's inequality (Stam, 
1959) - to relate this sequence model to the Gaussian sequence model with iid errors. The re- 
sults in this paper share some similarities with those found in (Goldenshluger and Tsybakov, 
2001, 2003), which address minimax prediction over £^-ellipsoids. However, the implications 
of our results and the methods used to prove them differ substantially from Goldenshluger 
and Tsybakov's (this is discussed in more detail in Sections 2.2-2.3 below). 

2. Background and preliminaries 

2.1. Statistical setting 

Suppose that the observed data consists of outcomes yi, ..., y„ G M and d-dimensional predic- 
tors Xi, ...,x„ G M^. The outcomes and predictors follow a linear model and are related via 
the equation 

?/i = xf/3 + ei, 2 = l,...,n, (1) 

where (3 = (/3i, ..., /S^)^ G M'^ is an unknown parameter vector (also referred to as "the 
signal") and ei, ...,€„ are unobserved errors. To simplify notation, let y = (yi, ...,y„)^ G M", 
X = (xi, ...,x„)^, and e = (ei, ...,e„)"^. Then (1) may be rewritten as y = Xf3 + e. In many 
high- dimensional settings it is natural to consider the predictors Xj to be random. In this 
paper, we assume that 

xi, ..., x„ ~ iV(0, /) and e,, ..., e„ ~ iV(0, 1) (2) 

are independent, where I = Id is the d x d identity matrix. These distributional assumptions 
impose significant additional structure on the linear model (1). However, similar models have 
been studied previously (Baranchik, 1973; Breiman and Freedman, 1983; Brown, 1990; Leeb, 
2009; Stein, 1960) and we believe that the insights imparted by the resulting simplifications 
are worthwhile. For the results in this paper, perhaps the most noteworthy simplifying con- 
sequence of the normality assumption (2) is that the distributions of X and e are invariant 
under orthogonal transformations. 

We point out that the assumption -E(xj) = (which is implicit in (2)) is not particularly 
hmiting: if E{'Xi) ^ 0, then we can reduce to the mean case by centering and decorrelating 
the data. If Var(ej) = a"^ ^ 1 and o"^ is known, then this can easily be reduced to the case 
where Var(ei) = 1. If a^ is unknown and d < n, then a^ can be effectively estimated and 
one can reduce to the case where Var(ei) = 1 (Dicker, 2012). We conjecture that o"^ can be 
effectively estimated when d > n, provided sup d/n < oo (for sparse /3, Sun and Zhang (2011) 
and Fan et al. (2012) have shown that a^ can be estimated when d ^ n). Dicker (2012) has 
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discussed the implications if Cov(xj) = S ^ I. Essentially, when the emphasis is prediction 
and non-sparse signals, if a norm-consistent estimator for Cov(xj) = Z' is available, then 
it is possible to reduce to the case where Cov(xj) = /; if a norm-consistent estimator is not 
available, then limitations entail, however, these limitations may not be overly restrictive (this 
is discussed further in Section 3.2 below). 

Let II • II = II ■ II2 denote the £^-norm. In this paper we study the performance of estimators 
f3 for f3 with respect to the risk function 

i?(^,/3) = RdA^,(3) = E^\\^-(3\\^ (3) 

where the expectation is taken over (e, X) and the subscript f3 in Ep indicates that y = Xf3+e 
(below, for expectations that do not involve y, we will often omit this subscript). We emphasize 
that the expectation in (3) is taken over the predictors X as well as the errors e, i.e. it is 
not conditional on X. The risk R{f3, (3) is a measure of estimation error. However, it can also 
be interpreted as the unconditional out-of-sample prediction error (predictive risk) associated 
with the estimator (3 (Breiman and Freedman, 1983; Leeb, 2009; Stein, 1960). 

2.2. Dense signals, sparse signals, and ellipsoids 

Let B{c) = Bd{c) = {/3 e W^] ||/3|| < c} denote the f^.ball of radius c > 0. Though a 
given signal /3 G M'^ is often considered to be dense if it has many nonzero entries, when 
studying broader properties of dense signals and dense estimators it is common to consider 
minimax problems over highly symmetric, convex (or loss-convex (Donoho and Johnstone, 
1994)) parameter spaces. Following this approach, one of the primary quantities that we 
use as a benchmark for evaluating estimators and determining performance limits in dense 
estimation problems is the minimax risk over B{c): 

R^'\c) = RfSc) = inf sup R0,(3). (4) 

/3 /3eB(c) 

The infimum on the right-hand side in (4) is taken over all measurable estimators /3 and the 
superscript "6" in R^^\c) indicates that the relevant parameter space is the £^-ball. 

A basic consequence of the results in this paper is R^''\c) x d/n. Thus, one must have 
d/n — )■ in order to ensure consistent estimation over B{c). This is a well-known feature of 
dense estimation problems and, as mentioned in Section 1, contrasts with many results on 
sparse estimation that imply f3 may be consistently estimated when d/n — ?■ 00. However, the 
sparsity conditions on f3 that are required for these results may not hold in general and our 
motivating interest lies precisely in such situations. In this paper we derive sharp asymptotics 
for R^^\c) and related quantities in settings where d/n — )• 0, d/n -^ p E (0, 00), and d/n — ?> 00 
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(we assume that d — )■ oo throughout). Though consistent estimation is only guaranteed when 
d/n — )■ 0, there are important situations where one might hope to analyze high- dimensional 
datasets with d/n substantially larger than 0, even if there is little reason to believe that 
sparsity assumptions are valid. The results in this paper provide detailed information that 
may be useful in situations like these. 

In addition to sparse estimation problems, minimax rates faster than d/n have also been 
obtained for minimax problems over £^-ellipsoids, which have been studied extensively in situa- 
tions similar to those considered here (Cavalier and Tsybakov, 2002; Goldenshluger and Tsybakov, 
2001, 2003; Pinsker, 1980). Much of this work has been motivated by problems in nonpara- 
metric function estimation. The results in this paper are related to many of these existing 
results, however, there are important differences - both in their implications and the tech- 
niques used to prove them. Goldenshluger and Tsybakov's (2001, 2003) work may be most 
closely related to ours. Define the £^-ellipsoid B{c,cx) = {/3 G M'^; XlILi '^^A^ ^ c^}; '^i^h 
ex = (tti, ...,ad)'^ G Mf^, < «! < ■ ■ ■ < ad- Goldenshluger and Tsybakov studied minimax 
problems over £^-ellipsoids for a linear model with random predictors similar to the model 
considered here (in fact, Goldenshluger and Tsybakov's results apply to infinite-dimensional 
non-Gaussian Xj, though Xj are required to have Gaussian tails and independent coordinates). 
They identified asymptotically minimax estimators over B{c, a.) and adaptive asymptotically 
minimax estimators and showed that the minimax rate may be substantially faster than 
d/n. However, their results also require the axes of B{c,cx) to decay rapidly (i.e. aa/c -^ oo 
quickly) and do not apply to £^-balls B{c) = B{c, (1, ..., 1)-^) unless d/n — >■ 0. Though these 
decay conditions are natural for many inverse problems in nonparametric function estimation, 
they drive the improved minimax rates obtained by Goldenshluger and Tsybakov and may 
be overly restrictive in other settings, such as the genomics applications discussed in Section 
1 above. 

2.3. The sequence model 

Minimax problems over restricted parameter spaces have been studied extensively in the 
context of the sequence model. In the sequence model, given an index set J, 



^3 



e.j + 5. J, J G J, (5) 



are observed, 6 = {Oj)jeJ is an unknown parameter, and 5 = {6j)j(zj is a random error. The 
sequence model is extremely flexible, and many existing results about the Gaussian sequence 
model (where the coordinates of 6 are iid Gaussian random variables) have implications for 
high- dimensional linear models (Cavalier and Tsybakov, 2002; Pinsker, 1980). However, these 
results tend to apply in linear models where one conditions on the predictors, as opposed to 
random predictor models like the one considered here. 
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In order to prove the main result in this paper (Theorem 1), we study a sequence model 
with non-Gaussian orthogonally invariant errors that is equivalent to the linear model (1). 
Goldenshluger and Tsybakov (2001) also studied a non-Gaussian sequence model that derives 
from a high- dimensional linear model with random predictors, but their results have limita- 
tions in settings where d/n — )■ p > 0, as discussed in Section 2.2 above. In our analysis, 
orthogonal invariance is heavily exploited to obtain precise results in any asymptotic setting 
where d — )■ oo. This appears to be a key difference between our analysis and Goldenshluger 
and Tsybakov's. 

2.4- Minimax problems over i^ -spheres and orthogonal equivariance 

Define the £^-sphere of radius c, S{c) = Sd{c) = {/3 G M'^; ||/3|| = c}. Though it is common 
in dense estimation problems to study the minimax risk over £^-balls R^''\c), which is one of 
the primary objects of study here, we find it convenient and informative to consider a closely 
related quantity, the minimax risk over S{c), 

R^'\c) = Rl^'lic) = ini sup R0,(3) 

13 /3G5(c) 

(the superscript "s" in R^'^\c) stands for "sphere"). For our purposes, the primary significance 
of considering £^-spheres comes from connections with orthogonal invariance and equivariance. 
Let 0{d) denote the group oi d x d orthogonal matrices. 

Definition 1. An estimator (3 = /3(y,X) for f3 is orthogonally equivariant if 

f/^^(y,X)=^(y,Xf/) (6) 

for all U e 0(d). D 

Orthogonally equivariant estimators are compatible with orthogonal transformations of the 
predictor basis. They may be appropriate when there is little information carried in the given 
predictor basis vis-a-vis the outcome; by contrast, knowledge about sparsity is exactly one 
such piece of information. Indeed, sparsity assumptions generally imply that in the given basis 
some predictors are significantly more influential than others. Sparse estimators attempt to 
take advantage of this to improve performance and are typically not orthogonally equivariant. 

The concept of equivariance plays an important role in statistical decision theory (e.g. 
(Berger, 1985), Chapter 6). However, it seems to have received relatively little attention in the 
context of linear models. Significant aspects of equivariance include: (i) in certain cases, one 
can show that it suffices to consider equivariant estimators when studying minimax problems 
and (ii) equivariance may provide a convenient means for identifying minimax estimators. This 
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is basically the content of the Hunt-Stein theorem and both of these features prevail in the 
present circumstances. To make this more precise, define the class of equivariant estimators 

(S = S'{n, d) = {/3; f3 is an orthogonally equivariant estimator for f3} 

and define 

i?(^)(/3) = i?it(/3)=infi?(A/3). 

Additionally, let tTc denote the uniform measure on S{c) and let 

be the posterior mean of j3 under the assumption that /3 ~ vTc is independent of (e, X). Since, 
for U e 0{d), 

U^Kmfiy.X-c) = E^XU^f3\y,X) = E^^if3\y,XU) = ^_^(y,X?7; c), 

it follows that l^unifi^) ^ ^- The next result follows directly from the Hunt-Stein theorem 
and its proof is omitted. 

Proposition 1. Suppose that ||/3|| = c. Then 

R^^\c) = R^^\(3) = R{Kmf{cl(i}. (7) 

Furthermore, if (3 E S , then R{f3,f3) depends on (3 only through c. 

In a sense. Proposition 1 completely solves the minimax problem over S{c). On the other 
hand, the minimax estimator l^unifi.^) i^ challenging to compute and it is desirable to identify 
good estimators that have a simpler form. Moreover, though /3„„jj(c) solves the minimax 
problem over S{c)^ it is unclear how R^^\c) relates to the minimax risk over £^-balls, which is 
a more commonly studied quantity in dense estimation problems. Finally, the minimax esti- 
mator 0unif{^) depends on c = 1 1/3| | , which is typically unknown in practice. All of these issues 
must be addressed in order to identify practical estimators that perform well in dense problems 
for high-dimensional linear models. This is accomplished below, where we show: (i) a linear 
estimator (ridge regression) is asymptotically equivalent to /3^„jj(c), (ii) R^^\c) ~ R^^\c) (i.e. 
R^^\c) / R^^\c) — )■ 1), and (iii) under certain conditions c = ||/3|| may be effectively estimated. 
Similar results have been obtained for the Gaussian sequence model with iid errors (Beran, 
1996; Marchand, 1993). Our results rely on an inequality of Marchand's (Proposition 11 below) 
and extend Marchand's and Beran's results to linear models with Gaussian predictors. 

Proposition 1 and the related discussion imply that equivariant estimators have certain nice 
properties and are closely linked with dense estimation problems. On the other hand, the next 
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result describes some of the limitations of orthogonally equivariant estimators when d > n 
and is indicative of some of the challenges inherent in dense estimation problems beyond the 
consistency requirement d/n — )■ 0. 

Lemma 1. Suppose that f3 = f3{y,X) G S". Then (3 is orthogonal to the null-space of X. 

Proof. Suppose that rank(X) = r < d and let X = UDV^ be the singular value decomposi- 
tion of X, where U e 0{n),V e 0{d), 

V 

is an 77, X d matrix, and Dq is an r x r diagonal matrix with rank r. Let Vq denote the first 
r columns of V and let Vi denote the remaining d — r columns of V . Finally, suppose that 
WieO{d-r) and let 

Then the null space of X is equal to the column space of Vi and it suffices to show that 
V^^ = 0. By equivariance, 

P = VWP{y,XVW) = VWP{y,UD). (8) 

Thus, 

V,^^ = V,^VW^{y,UD) = {0 W)^{y,UD). (9) 

Since /3(y, UD) does not depend on W and (9) holds for all W G 0{d — r), it follows that 
V^f3 = 0, as was to be shown. D 

Lemma 1 is a non-estimability result for orthogonally equivariant estimators. It will be used 
in Sections 3.3 and 6 below. 

2.5. Linear estimators: Ridge regression 

Linear estimators play an important role in dense estimation problems in many statistical 
settings. Fundamental references include (James and Stein, 1961; Pinsker, 1980; Stein, 1955). 
Pinsker (1980) showed that under certain conditions, linear estimators in the Gaussian se- 
quence model are asymptotically minimax over £^-ellipsoids. In the linear model, linear es- 
timators have the form f3 = Ay, where A is a data-dependent d x n matrix, and they are 
convenient because of their simplicity. Define the ridge regression estimator 

^^(c) = {X^X + d/c^iy^X^y, c G [0, oo]. 
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By convention, we take /^^(O) = and 13 ^{oo) = fB^i^ = {X^X)~^X'^-y to be the ordinary least 
squares (OLS) estimator. Furthermore, throughout the paper, if a matrix A is not invertible, 
then A~^ is taken to be its Moore-Penrose pseudoinverse (thus, the OLS estimator is defined 
for all d, n). Clearly, /3^(c) is a linear estimator. Furthermore, it is easy to check that (^^{c) G S". 
Dicker (2012) studied finite sample and asymptotic properties of R{j3^.{c),j3}. Some of these 
properties will be used in this paper and are summarized presently. 



2.5.1. Oracle estimators 

Define the oracle ridge regression estimator 

K = K{m\)- 

This estimator is called an oracle estimator because it depends on ||/3||, which is typically 
unknown. Proposition 5 of (Dicker, 2012) implies 



2 r\-l 



i?(/3„/3)= inf R{f3Xc).fi] = Eti{X'X + dl\\f3\YI) 

cg[0,oo] 



and, furthermore. 



i?{^,(||/3||),/3o} < i?{/3,(||/3||),/3}, if ||/3oll < MV 



(10) 



;ii) 



The next result gives an expression for the asymptotic predictive risk of (3^. Its proof relies 
heavily on properties of the Marcenko-Pastur distribution (Bai, 1993; Marcenko and Pastur, 
1967). 

Proposition 2 (Proposition 8 from (Dicker, 2012)). Suppose that < p^ < d/n < p+ < oo 

for some fixed constants p~ , p~^ G M and define 

r>o(p, c) = ^ p(p - 1) - p + v/{c2(p-l)-p}2 + 4cV 

Zp L 

(a) // < p~ < p"*" < 1 or 1 < p~ < p"*" < oo and n — d > 5, then 

II/3IP 



i?(/3„/3)-r>o(rf/n,||/3||) 
(b) // < p" < 1 < p+ < oo, then 



O 



II/3IP + 1 



n 



-1/4 



R{(3,,(3)-r^o{d/n,m\) =0(||/3||^n 



|2^-5/48^ 
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Notice that Proposition 2 implies the asymptotic predictive risk of (3^. is non-vanishing if 
d/n — 7- p > 0. The main results in this paper are essentially asymptotic optimality results for 
(3 J,. In particular, we show that (3^ is asymptotically minimax over £ -balls and £ -spheres, 
and asymptotically optimal among the class of orthogonally equivariant estimators. Combined 
with Propositions 2-3, these results immediately yield sharp asymptotic for R^''\c), R^^\c), 
andi?(^)(/3). 

Taking a Bayesian point-of-view, our optimality results for (3^ are not surprising. Indeed, in 
Section 2.3 we observed that if \\/3\\ = c, then fi^nifi.^^) = -^ttcI/^Iy? ^) is niinimax over S{c) and 
is optimal among orthogonally equivariant estimators for (3. On the other hand, if \\(3\\ = c, 
then the oracle ridge estimator /3^ = f3j.{c) = -E'7v(o,c2/'^/)(/^ly'^) ™-^y ^^ interpreted as the 
posterior mean of f3 under the assumption that f3 ~ A^{0, {c^/d)I} is independent of e and X. 
Furthermore, if d is large, then the normal distribution N{0, {(? /d)I} is "close" to tTc (there 
is an enormous body of literature that makes this idea more precise - Diaconis and Freedman 
(1987) attribute early work to Borel (1914) and Levy (1922)). Thus, it is reasonable to ex- 
pect that (3unifi^) ~ /3r(c) and that, asymptotically, the oracle ridge estimator shares the 
optimality properties of $unif{^)i which is indeed the case. 

2.5.2. Adaptive estimators 

Adaptive ridge estimators will also be discussed in this paper. As mentioned above, \\(3\\ is 
typically unknown; hence, f3^. is typically non-implementable. However, f3^. may be approx- 
imated by an adaptive estimator where \\(3\\ is replaced with an estimate - this estimator 
"adapts" to the unknown quantity \\(3\\. Define 

1 1/31 1 = max <^ ^^^ - 1, 
1^ n 

and define the adaptive ridge estimator 

K = km\)- (12) 

Note that ||/3|| is a consistent estimator of ||/3|| , as n — )■ oo. 

Proposition 3. Suppose that < p~ < d/n < p'^ < 1 for some fixed constants p^ , p^ G M. 
If n — d > 5, then 



R{hr.,(3)-R{f3lf3) 



O ( i n^^'^ 



L. Dicker/Dense Signals and High- Dimensional Linear Models 11 

The proof of Proposition 3 is nearly identical to the proof of Proposition 10 from (Dicker, 
2012) and is omitted. Proposition 3 implies that if d/n — )■ p G (0,1), then the adaptive 
ridge estimator has nearly the same asymptotic risk as the oracle ridge estimator. Note the 
restriction d/n < 1 in Proposition 3. This restriction also appears in (Dicker, 2012), where 
Var(ej) = cr^ is unknown and the signal-to-noise ratio ||/3| p/cr^, as opposed to ||/3|P, is the 
quantity that must be estimated to obtain an adaptive ridge estimator; in this context, d/n < 1 
is a fairly natural condition for estimating o"^. It is possible to extend Proposition 3 to settings 
where d/n > 1. However, if d/n > 1, then the corresponding error term in Proposition 3 is 
no longer uniformly bounded in ||/3|p. Additionally, notice that Proposition 3 does not apply 
to settings where d/n — >■ 0. A more careful analysis may lead to extensions in this direction 
as well. Since adaptive estimation is not the main focus of this article, these issues are not 
pursued further here; however, future research into these issues may prove interesting. 

2.6. Outline of the paper 

The main results of the paper are stated in Section 3. Most of these results follow from 
Theorem 1, which is stated at the beginning of the section. The remainder of the paper is 
devoted to proving Theorem 1. In Section 4, the equivalence between the linear model and the 
sequence model is formalized. The first part of Theorem 1, which applies to the setting where 
d < n, is proved in Section 5. This part of the proof involves converting error bounds for 
the Gaussian sequence model with iid errors into useful bounds for the relevant non-Gaussian 
sequence model. The second part of Theorem 1 [d > n) is proved in Section 6. When d > n, 
X'^X does not have full rank. The major steps in the proof for d > n involve reducing the 
problem to a full rank problem. 

3. Main results 

The results in this section are presented in terms of the linear model. However, most have 
equivalent formulations in terms of the sequence model introduced in Section 4 below. 

Theorem 1. Suppose that n > 2 and let si > ■ ■ ■ > Sd/\n > denote the nonzero (with 
probability 1) eigenvalues of{X^X)^^. 

(a) If d < n, then 

R01 f3) - R^^\(3) 




X^X ^ 



m\' 
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(b) If d > n, then 



R0l,(3)-R^^\f3) 



1 rn Ml . /^ .. ..T , d 



^ u^ ^'^ i-^^" - m' 



d — n 1 ^ ( ^^^^T d '\ 



From (10) and Proposition 1, it is clear that R{(3j,, (3) and R^^\f3) are finite. Moreover, basic 
properties of tfie Wisliart and inverse Wishart distributions imply tliat the upper bounds in 
Theorem 1 are finite, provided |n— (i| > 1; when |n— (i| < 1, these bounds are infinite. However, 

if \n — d\ < 1, then the inequahties Rd^niPr^P) — Rd,n-i0r^ P) ^^^ R^l^{j3) < Rj^l_^{f3) may 
be combined with Theorem 1 (b) to obtain nontrivial bounds. 

In what remains of this section, we discuss some of the consequences of Theorem 1 and 
related results in three asymptotic settings: d/n -^ (with (i — )■ oo, as well), d/n — )■ p G 

(0, oo), and d/n -^ oo. 

3.1. d/n -^ 

Proposition 4. Define 

c^ p 
ro{p,c) 



c^ + p' 



If d/n — )■ and d — )■ oo, then 

R0;,f3) ~ i?(^)(/3) ~ R^'Wim ~ R^'wm) ~ ro{d/n,\\m 

uniformly for f3 eM.'^. 

Proof, li d + 1 < n, then (10) and Jensen's inequality imply that 

d/n „/?.* ^^ d/n ,,„, 



l + d/(n||/3||2) - '^^^'^^ - i-(d+l)/n + d/in\\(3\\^) 

It follows that R{f3j.,f3) ~ VQ^d/n, ||/3||). By Theorem 1, in order to prove 

R'~^\f3)r.roid/n,\\m, (14) 

it suffices to show that 

^E l^tr {X^X + d/m\'l)-'^ = o{ro{d/n,m\)}- 
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But this is clear: 

= o|iro(d/n,||/3||) 

= o{ro(rf/n,||/3||)}, 

where we have used the facts E{s\) = 0{n~^) and E{s~^ ) = 0{n'') (Lemma A2, (Dicker, 
2012)). Thus, (14). Since R^''^{\\l3\\) = R^'\l3), all that is left is to prove is R^^'>{\\l3\\) ~ 
-R(*)(||/3||). This follows because 

i?W(||/3||) < R'-'Hrni) < R0:,(3) ~ R^'Wim, (16) 

where we have used (11) to obtain the second inequality. D 

The asymptotic risk rQ{p,c) appears frequently in the analysis of linear estimators for 
the Gaussian sequence model (Pinsker, 1980) and is often referred to as the "linear minimax 
risk." The condition d — )■ oo in Proposition 4 is important because it drives the approximation 
TTc ~ N{0,c^/dl), which enables us to conclude R^^\f3) ~ R{l3j,,f3) (re: the discussion at the 
end of Section 2.4). Notice that linid/n^o^o{p,c) = 0. Thus, the minimax risk vanishes when 
d/n -^ 0. 

Proposition 4 implies that the ridge estimator (3^ is asymptotically minimax if d/n — > and 
(i — i> oo. On the other hand, other simple linear estimators are also asymptotically minimax 
in this setting. Define the estimator 

l-{d+l)/n 
l-{d+l)/n + d/{n\\l3\[^ 



P seal 1 t J I ^\ I i j 1 1 Ilall9\ Pols' 



Note that /^^^a/ i^ ^ scalar multiple of the OLS estimator and that (Sg^ai i^ defined for all 
d, n since (3^1^ is defined using pseudoinverses. Various versions of $scai have been studied 
previously (Baranchik, 1973; Brown, 1990; Stein, 1960). Dicker (2012) showed that ii d+1 < n, 
then 

^ ^ '^ * d / Ti 

< Rivals, (3) ^/^ 



l-(ci+l)/n' 
The following corollary to Proposition 4 follows immediately. 
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Corollary 1. (a) If d/n — )■ and d — )■ oo, then 

uniformly for /3 G M"'. 
(b) If d/n -^ 0, d-> oo, and rf/(n||/3|p) -)• s > 0, then 

R0ols,f3) 

i?W(||/3||) 

In other words, if d/n — > and d — )■ oo, then fB^^g^i is asymptotically minimax over ^^-balls 
(and, moreover, asymptotically equivalent to /3^). Furthermore, the OLS estimator may be 
asymptotically minimax over £^-balls, but this depends on the magnitude of the signal /3: If 
||/3|p is large, then the OLS estimator is asymptotically minimax; if ||/3|p is small, then it is 
not. 

3.2. d/n ^ p 6 (0, oo) 

The setting where d/n — >■ p G (0, oo) may be the most interesting one for the dense estima- 
tion problems considered here. The minimax risk is non-vanishing in this setting; however, 
informative closed form expressions for the limiting minimax risk are available. Moreover, 
differences between the linear estimators (3scai ^^'^ Pr which are insignificant when d/n — )■ 
become pronounced when d/n — )■ p G (0, oo). These differences are largely attributable to the 
spectral distribution of n~^X^X, which is asymptotically trivial if d/n — )■ and converges to 
the Marcenko-Pastur distribution (Marcenko and Pastur, 1967) if d/n — )■ p G (0, oo). 

Proposition 5. Suppose that p G (0, oo) and let R*{/3) denote any of R(f3^., l3), R^^\f3), 
R^'\\\l3\\), orR^^\(3). Ifp^l, then 

lim sup |i?*(/3)-r>o(rf/n,||/3||)| = 0, (18) 



where r>o(p, c) is defined in Proposition 2 above. Furthermore, as d/n — )■ p, 

R{^:,(3) ~ i?(^)(/3) ~ R^'Wm) - R^'Wim - r>o(rf/n, II/3II). (19) 

If p ^ 1, then the implied convergence in (19) holds uniformly for (3 G W^; if p = 1, then the 
convergence is uniform over B{c) for any fixed c G (0, oo). 
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Proof. Proposition 2 implies that |i?(^*,/3)-r>o(rf/n, 1 1/3| I) I ^ and i?(^*,/3) ~ r>o(rf/n, ||/3||), 
with the appropriate uniformity conditions when p ^ 1 ot p = 1. For p < 1, the asymptotic 
equivalences \R'^^\(3) - R0*r,l3)\ -^ and R^^\l3) ~ R{K,fi) follow from (13) and (15); to 
prove the equivalences for p > 1, notice that 



n ysn J 



+2,f-" Etr(XX^ + d/c-I)-^ ^\<m? + 1) 
c^(n — 2) 



II/3IP 



Since i?(^)(/3) = /2(")(||/3||), it suffices to show that 



hm sup|i?W(||/3||)-i?(^)(||/3||)|=0 

d/n—^p , 



and that i?^''''(||/3||) ~ /?'•''•' (||/3||) uniformly for /3 G M'^ in order to prove the proposition; both 
follow from (16). D 

Two types of asymptotic equivalence are addressed in Proposition 5: differences (18) and 
quotients (19). The equivalence (18) is more informative for large 1 1/3| |; (19) is more informative 
for small ||/3||. Notice that for fixed ||/3|| = c G (0, oo), \mid/n^pf>o{.d/n,c) = r^Q{p,c) > 
and it follows that (18) and (19) are equivalent. 

For d/n — )■ 0, we saw that (3 seal ^^d (3^ were asymptotically equivalent (and that, in some 
instance, both were also asymptotically equivalent to the OLS estimator; Corollary 1). When 
d/n — > p G (0, oo), /3^ and /J^^aZ ^^^ ^^^ asymptotically equivalent. Indeed, (17) implies that 
for d/n — )■ 0, we have 

R0lai,P)-rsUd/n,\m), 
where 

( ^ - i~P 

1 - p + p/c^ 
One easily checks that for p > 0, r>o(p, c) < Tscaiip, c) with equality if and only if c = 0. Thus, 
(3 seal is not asymptotically minimax over £ -balls when d/n — )• p G (0, oo). 

Despite its suboptimal performance, the estimator Ps^^i may be useful in certain situations. 
Indeed, if Cov(xj) = E ^ I, then it is straightforward to implement a modified version of 
^scai with similar properties (replace ||/3|P in ^^^a/ "with f3'^Sf3); on the other hand, if S is 
unknown and a norm-consistent estimator for S is not available, then this may have a more 
dramatic effect on the ridge estimator /3^. This is discussed in detail in (Dicker, 2012), where 
it is argued that in dense problems where little is known about Cov(xj), an appropriately 
modified version of Pscai is a reasonable alternative to ridge regression (note, for instance, 
that Ri^l,i,f3)/R0l,f3) = 0(1) if d/n ^ p e (0, oo)). 
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3.3. d/n — >■ oo 

Theorem 1 plays a crucial role in our asymptotic analysis when d/n — )■ p < oo. It is less 
relevant in the setting where d/n — >■ oo. Instead, Lemma 1 from Section 2.4 plays the key 
role. We have the following proposition. 

Proposition 6. Suppose that d > n and that (3 & S' . Then 

i?(A/3)>^ii/3ir- 

Proof. Let X = UDV^ be the singular value decomposition of X, as in the proof of Lemma 
1. Let Vo and Vi be the first r and the remaining d — r columns of V, respectively, where 
r = rank(X) (note that r = n with probability 1). Then 

R0,(3) = E\\p-(3\\' 

= E\\V,^0-m' + E\\V,^f3\\' (20) 

= ^m', (21) 

n 

where (20) follows from Lemma 1 and (21) follows from symmetry. D 

The proof of Proposition 6 essentially implies that for d > n, the squared bias of an equiv- 
ariant estimator must be at least ||/3|p((i — n)/d. This highlights one of the major challenges 
in high-dimensional dense estimation problems, especially in settings where d ^ n. The next 
proposition, which is the main result in this subsection, implies that if d/n — )■ oo, then the 
trivial estimator /3,„„ii = is asymptotically minimax. In a sense, this means that in dense 
problems f3 is completely non-estimable when d/n — )■ oo. 

Proposition 7. Let (3nuii = 0- Then R{f3^^ii,f3) = ||/3|p. Furthermore, if d/n — )■ oo, then 

R0l(3) ~ i?(^)(/3) ~ i?W(ll/3|l) - R^'Wm) - R0nuu,P) ~ II/3IP 
uniformly for f3 ^M.'^. 

Proof. Clearly, R{/3^^ii,f3) = ||/3|p. It follows from Proposition 6 that for d > n, 
d — n . 



n 



■||/3|p<i?(^)(/3)=i?W(||/3|p)<i?W(||/3|| 



2^ 



<i?(/3„/3)<i?(/3„,„„/3) = ||/3|p. 
The proposition follows by dividing by ||/3|P and taking d/n — )■ oo. D 
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3.4- Adaptive estimators 

The results in Section 3.1-3.3 imply that the oracle ridge estimator /3^. = /3^(| |/3| |) is asymptot- 
ically minimax over £^-balls and ^^-spheres and is asymptotically optimal among equivariant 
estimators for (3 in any asymptotic setting where (i — )■ oo. The next result describes asymp- 
totic optimality properties of the adaptive ridge estimator ^^ (defined in (12)), which does 
not depend on ||/3||. 

Proposition 8. Suppose that p E (0,1) and let R*{/3) denote any of R{f3j., f3), R^^\l3), 
R^^\\\l3\\), R^^\l3), or r-^Q{d/n,\\l3\\). Let {an}'^=i C M denote a sequence of positive real 
numbers such that a^.v^l'^ — )■ oo. Then 



lim sup 

d/n—^p 



R*{(3)-R{$1,(3) 



R{$;,(3) 

U and hm sup = 1. 



Proposition 8 follows immediately from Propositions 3 and 5. The restriction ||/3|p ^ n^^"^ 
in the second part of Proposition 8 is related to the fact that for d/n — )■ p G (0, cx)), R{(3ry (^) = 
0(||/3|p) and the error bound in Proposition 3 is 0(n~^/^). As discussed in Section 2.5.2, more 
detailed results on adaptive ridge estimators are likely possible (that may apply, for instance, 
in settings where din — )■ or djn — )■ p > 1), but this not pursued further here. 

4. An equivalent sequence model 

The rest of the paper is devoted to proving Theorem 1. In this section and Section 5, we 
assume that li < n. In Section 6, we address the case where d > n. The major goal in this 
section is to relate the linear model (1) to an equivalent non-Gaussian sequence model. 

4.1. The model 

Let Z" be a random orthogonally invariant mxm positive definite matrix with rank m, almost 
surely (by orthogonally invariant, we mean that S and USU'^ have the same distribution for 
any U G 0{m)). Additionally, let 60 ~ N{0,lm) be a ci-dimensional Gaussian random vector 
that is independent of S. Recall that in the sequence model (5), the vector z = {zj)j(zj = 
+ 6 is observed and J is an index set. In the formulation considered here, J = {1, ..., m}, 
6 = E^^'^Sq, and S is observed along with z. Thus, the available data are (z, S) and 

z = 6 + 6 = e + U^/^do e M™. (22) 
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Notice that 6 is in general non-Gaussian. However, conditional on i7, <5 is a Gaussian random 
vector with covariance E. We are interested in the risk for estimating 6 under squared error 
loss. For an estimator = 6{z, S), this is defined by 

R{e,e) = Ee\\e{z, s) - 6>|p = EeWe - e\\^, 

where the expectation is taken with respect to 6q and S (we use "~," as in R, to denote 
quantities related to the sequence model, as opposed to the linear model). 

4-2. Equivariance and optimality concepts 

Most of the key concepts initially introduced in the context of the linear model have analogues 
in the sequence model (22). In this subsection, we describe some that will be used in our proof 
of Theorem 1. 

Definition 2. Let Q = 6{z, S) be an estimator for 0. Then is an orthogonally equivariant 
estimator for if 

ue{z,E) = e{Uz,u^uu) 

for all U e 0{d). D 

Let 

S = Sd = {0\ 6 is an orthogonally equivariant estimator for 6} 

denote the class of orthogonally equivariant estimators for 6. Also define the posterior mean 
for under the assumption that 6 ~ tTc, 

Ounif{c) =K^(0|z,r) 

and the posterior mean under the assumption that 6 ~ A^(0, c^ /ml). 

dr{c) = E^f(o,c2/mJ)(^|z, ^) = c^/d{E + c^/miy z 

(for both of these Bayes estimators we assume that is independent of 5q and S). The 
estimators Ounif{c) and Or{c) for are analogous to the estimators (3unif{(^) ^^d /^^(c) for /3, 
respectively. Moreover, they are both orthogonally equivariant, i.e. 0unif{c),6r{c) G S', and 
0r{c) is a linear estimator. Now define the minimal equivariant risk for the sequence model 

r(^\0) = r(^){0) = ini R{0,0) 

u^Sseq 
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and the ininimax risk over the £^-sphere of radius c, 

^(^)(c) = ^(^)(c) = inf sup R{e,e), 

e 0GS(c) 

where the infimum above is taken over all measurable estimator = 0{z, S). The Hunt-Stein 
theorem yields the following result, which is entirely analogous to Proposition 1. 

Proposition 9. Suppose that \\6\\ = c. Then 

R{s)^c) = R^'\e) = R{e^mf{c),e}. 

Furthermore, if 9 E S, then R{0,0) depends on only through c. 

4-3. Equivalence of the sequence model and the linear model 

The next proposition helps characterize the equivalence between the linear model (1) and the 
sequence model (22). 

Proposition 10. Suppose that d < n, m = d, and E = {X'^X)~^. 

(a) If 13 = 9 and z = (X^X)~^X^y, then (3unif{c) = 9unif{c), ^^^(c) = 9r{c), and 

R0,{c),(3} = R{er{c),e}. 

(h) If\\e\\ = II/3II =c, then 

i?{^_/c),/3} = /2(^)(/3) = i?(^)(c) 

= R^'\c) = R^-\e) = R{0umf{c),0}. 

Part (a) of Proposition 10 is obvious; part (b) follows from the fact that {{X^X)~^X^y, 
{X^X)~^) is sufficient for f3 and the Rao-Blackwell inequality. Proposition 10 implies that it 
suffices to consider the sequence model in order to prove Theorem 1. 

5. Proof of Theorem 1 (a): Normal approximation for the uniform prior 

It follows from Proposition 9 that the Bayes estimator Ounif{c) is optimal among all orthog- 
onally equivariant estimators for 6, if \\6\\ = c. In this section, we prove Theorem 1 (a) by 
bounding 

R{0ric),0}-R{0unif{c),0}\ (23) 

and applying Proposition 10. 

Marchand (1993) studied the relationship between 0unif{c) and 0r{c) under the assumption 
that 1 1^1 1 = c and S = t'^I (i.e. in the Gaussian sequence model with iid errors). Marchand 
proved the following result, which is one of the keys to the proof of Theorem 1 (a). 
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Proposition 11 (Theorem 3.1 from (Marchand, 1993)). Suppose that E = t"^! and \\6\\ = c. 
Then 



R{eric),o}-R{d^mfic),e} 



< 



1 c^r'^m 

-R{0r{c),0}. 

m 



Thus, in the Gaussian sequence model with iid errors, the risk of 0r{c) is nearly as small as 
that oi 0unif{c)- Marchand's result relies on somewhat delicate calculations involving modified 
Bessel functions (Robert, 1990). A direct approach to bounding (23) for general S might 
involve attempting to mimic these calculations. However, this seems daunting (Bickel, 1981). 
Brown's identity, which relates the risk of a Bayes estimator to the Fisher, allows us to sidestep 
these calculations and apply Marchand's result directly. 

Define the Fisher information of a random vector ^ G M™, with density /^ (with respect to 
Lebesgue measure on M'") by 

where V/^(t) is the gradient of /^(t). Brown's identity has typically been used for univari- 
ate problems or problems in the sequence model with iid Gaussian errors (Bickel, 1981; 
Brown and Gajek, 1990; Brown and Low, 1991; DasGupta, 2010). The next proposition is 
a straightforward generalization to the correlated multivariate Gaussian setting. Its proof is 
based on Stein's lemma. 

Proposition 12 (Brown's Identity). Suppose that rank(i7) = m, with probability 1. Let 
Ie{0 + ^^^^^o) denote the Fisher information of + E^/'^6q, conditional on E , under the 
assumption that r^ tTc is independent of 6o and E . If \\0\\ = c, then 

R{0unif{c)^ 0} = EtT{E) - EtT {EHs{0 + S^'H)] . 

Proof. Suppose that c = \\0\\ and let 

/(z)= / (27r)-^/Met(r-i/2)e-^(^-^)"^-'(^-^)rfvr,(0) 

Js(c) 

be the density of z = + E^^'^Sq, conditional on E and under the assumption that ~ tTc. 
Then 

rv/(z) 



0unif{c) = E^^{0\z, E) = z- E^^{E'^/^6o\z, E) = z 



/(z) 
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It follows that 



^11^ 



unify 



e\ 



E 



S'/^6 



rv/(z) 



EtT{E) + 2E 

+E 



/(z) 

5^r3/2v/(z) 






Eti{E) + 2E 






By Stein's lemma (integration by parts 

5^r3/2v/(z 



E 



/(z) 






E[tr{i;2v^log/(z)}] 



(24) 



(25) 
D 



Brown's identity follows by combining (24) and (25). 

Using Brown's identity, Fisher information bounds may be converted to risk bounds, and 
vice-versa. Its usefulness in the present context springs from (i) the decomposition 



z = e + s'/^So = {e + (7s„)^/'5i} + (r 



7-5^ 



)'^'S, 



(26) 



where 6i, 82 ~ iV(0, Im) are independent of E, s^ is the smallest eigenvalue of S, and < 7 < 
1 is a constant and (ii) Stam's inequality for the Fisher information of sums of independent 
random variables. 

Proposition 13 (Stam's inequality; this version due to Zamir (1998)). Let v,w G M™ he 

independent random variables that are absolutely continuous with respect to Lebesgue measure 
on M."^. For every m x m positive definite matrix S , 

tr [r2/(v + w)] < tr I r^ [/(v)-i + /(w)-i] "^| . 

Notice that conditional on S, the term + (73^)^ <5i in (26) may be viewed as an ob- 
servation from the Gaussian sequence model with iid errors. The necessary bound on (23) is 
obtained by piecing together Brown's identity, the decomposition (26), and Stam's inequality, 
so that Marchand's inequality (Proposition 11) may be applied to + {'jSmY^'^di. 
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Proposition 14. Suppose that S has rank m with probability 1 and that \\0\\ = c. Let 
si > ■ ■ ■ > Sm > denote the eigenvalues of S . Then 

R{er{c),e} - R{e^n,f{c),e} < -E j^tr {E-^ + m/cHY^\ . 

m ysm ' } 

Proof. It is straightforward to clieck tliat 

Ridric), 6} = EtT{S^^ + m/c^iy\ (27) 

Tlius, Brown's identity and (27) imply 

R{er{c),e}-R{eumf{c),o} = EtT{s^is{e + 6)} 

= Eti{E^Is{0 + S)} 

Taking v = 6 + {'ySmY^'^di and w = {S — 7Sm)^^^<52 in Stam's inequality, where <5i, 62, and 
< 7 < 1 are given in (26), one obtains 

i?{0,(c), 6} - RiOumfic), 6} < Eti (r^ [ijj{e + {^smY^^Si}-' 

+ r - -fSmI 

-EtT{S^{E + c^/miy^} 

By orthogonal invariance, lx;{0 + (7Sm,)^'^5i} = (Im for some C ^ 0. Marchand's inequality, 
another application of Brown's identity, and (27) with S = 'jSmlm imply that 



c< 



ISmJ 'JSm + c'^/m ■ 



Since 



-:-ls^>[m- 1) — ^, 



it follows that 

R{^ric),O}-R{0ur^^fic),0} < Eti 



UUS + (m-l] 



ISmC-" 
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Taking 7 11, 

R{0r{c),0] - R{0unif{c).0] < Etl 






m \Sm J 



The proposition follows because R{0unif{.c),0} < R{0r{c),0}. D 

Theorem 1 (a) follows immediately from Propositions 10 and 14. 

6. Proof of Theorem 1 (b): d > n 

It only remains to prove Theorem 1 (b), which is achieved through a sequence of lemmas. 
The first step of the proof focuses on the linear model (as opposed to the sequence model) 
and on reducing the problem where d > n and X'^X is not invertible to a full rank problem. 
This step builds on Lemma 1 from Section 2.4. 

Suppose that d > n and let X = UDV^ be the singular value decomposition of X, where 
U G 0{n), V & 0{d), D = [Dq 0), and Dq is a rank n diagonal matrix (with probability 1). 
Let W & 0{d) be uniformly distributed on 0{n) (according to Haar measure) and independent 
of e and X. Define the n x n matrix Xq = UDqW^ and consider the full rank linear model 

yo = Xo/3o + e, (28) 

where /3g G M". Notice that unlike X, the entries in Xq are not iid A^(0, 1). However, X^Xq 
is orthogonally invariant. As with the linear model (1), one can consider estimators /3o = 
/9o(yo,Xo) for /3q and compute the risk 

i?o(^o,/3o) = ^A.II^o-/3oll', (29) 

where the expectation in (29) is taken over e and Xq. We have the following lemma. 

Lemma 2. Suppose that d > n, ||/3|| = c, and (3 G (S'{n,d). Let Pq denote any fixed n x d 
projection matrix with orthogonal rows. Then there is an orthogonally equivariant estimator 
^o/3 ^ ^iP'^ fT) such that 

R{f3, (3) = / i?o(^o/3, Pob) d7re(b) + — ^cl 

JSd{c) " 
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Proof. As above, let X = UDV^ be the singular value decomposition of X. Let Vq denote 
the first n columns of V and let Vi denote the remaining d — n columns of V . By (8), 

^(y,X) = K)/3o(y,f//^o), 

where ^o/^ly, UDq) = (3Q{y, UDq) is the first n coordinates of /3(y, UD). Furthermore, it is 
easy to check that ^0$ is orthogonally equivariant, i.e. ^o(3 G S'{n,n). Thus, 

To prove the lemma, it suffices to show that 

By Proposition 1, orthogonal invariance of tTc, and orthogonal equivariance of /3q, 



= e\ I ||^o(f^^oKrb + e,t/Do) 
L Jsd{c) 

-ro^b||2rfvre(b) 
= e[ f \\^^^{UD^W^Poh + e,UD^) 

I JSi{c) 

-W^PohW^ d-K^ih) 
= [ E||^o(yo,^o)-Pob|prf7r,(b), 

as was to be shown. D 

Lemma 2 allows us to express the risk of an equivariant estimator for f3 in the linear model 
(1) with d > n in terms of the risk of another equivariant estimator in a different linear model 
(28) with d = n. Though the linear model (28) differs from the original linear model with 
Guassian predictors - thus. Theorem 1 (a) does not apply directly - (28) is equivalent to the 
sequence model (22), with m = n and S = {Xq Xq)~^ . 
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Lemma 3. Suppose that 2 < m = n < d and E = {XqXo)~^ in the sequence model (22). 
Also suppose that ||/3|| = c. Let Pq he a fixed n x d projection matrix with orthogonal rows 
and let Si > ■ ■ ■ > Sn > denote the eigenvalues of {X'^X)~^ . Then 



i?{^_.(c),/3} > 



> 



Sd(c) 

E 

Sd{c) 
+ 



R{e^nif{Pot),Pot}d7r,{t) + 



d — n 



d 



- ^\ tr (XX^ 
nsnj \ 



n 



I P +l|2 



/ 1 \ dTiAi] 



d — n 



> E 



d 



nSr 



n{d-2) 
c2(n-2)' 



d — n 
d 



c\ 



Proof. The first inequality follows from Lemma 2 and a suitably modified version of Proposi- 
tion 10 that describes the equivalence between the linear model (28) and the sequence model 
(22). The second inequality follows from Proposition 14 and the fact that XqXq and XX'^ 
have the same eigenvalues: 

R{e^mf{Pot),Pot} > i?{6».(Pot),Pot} 

_i^ /£itr(XoXo^ + n/\\Pot\\^I)-'] 

n [Sn J 

''^tT{X^Xo + n/\\Pot\\'iy' 



E 
E 



nSr. 



Sl 



nSr. 



tr(XX^ + n/||Pot|r/) ' 



The last inequality in the lemma follows from Jensen's inequality and the identity 
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d-2 
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We now have the tools to complete the proof of Theorem 1 (b). Suppose that d > n and 
II/3II = c. Then 



i?(^*, /3) = Eti{XX'^ + d/c^iy^ 



d — n 
d 



c\ 
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Since i?{^,,(c),/3} - R0^^if{c), (3} = i?{^,.(c),/3} - i?(^)(/3) > 0, Lemma 3 implies 

Sl 



26 



-E 



nsr 



tr^XX^+^(^-2) 



c2(n-2)' 



< -E['-^ti{XX^ + d/cHy^ 

n \Sn 

+2 ,f~^^^ Etr(XX^ + d/cHr\ 
C^yn — 2) 

Theorem 1 (b) follows. 
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